PSC: Parallel Spectral Clustering

نویسندگان

  • Wen-Yen Chen
  • Yangqiu Song
  • Hongjie Bai
  • Chih-Jen Lin
  • Edward Y. Chang
چکیده

Spectral clustering algorithm has been shown to be more effective in finding clusters than some traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a large document data set of 193, 844 instances and a large photo data set of 2, 121, 863, we demonstrate that our parallel algorithm can effectively alleviate the scalability problem. Note to reviewers: A preliminary version of this work appears in the proceedings of ECML 2008 [29]. This journal version has made three major enhancements over the ECML version. First, we have improved PSC and this paper provides implementation details in using the Message Passing Interface (MPI) and MapReduce framework. Second, we applied PSC on a much larger data set consisting of 2, 121, 863 data instances, and report more results. Third, we discussed and compared the Nyström method for spectral clustering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Spectral Clustering Algorithm Based on Hadoop

Spectral clustering and cloud computing is emerging branch of computer science or related discipline. It overcome the shortcomings of some traditional clustering algorithm and guarantee the convergence to the optimal solution, thus have to the widespread attention. This article first introduced the parallel spectral clustering algorithm research background and significance, and then to Hadoop t...

متن کامل

Parallel Spectral Clustering Algorithm for Large-Scale Community Data Mining

The spectral clustering algorithm has been shown to be very effective in finding clusters of non-linear boundaries. Unfortunately, spectral clustering suffers from the scalability problem in both memory use and computational time. In this work, we parallelize the algorithm by dividing both memory use and computation on distributed machines. Empirical study on some small datasets shows the accur...

متن کامل

Parallel Spectral Clustering

Spectral clustering algorithm has been shown to be more effective in finding clusters than most traditional algorithms. However, spectral clustering suffers from a scalability problem in both memory use and computational time when a dataset size is large. To perform clustering on large datasets, we propose to parallelize both memory use and computation on distributed computers. Through an empir...

متن کامل

Magnetic Activities in Outer Atmosphere of the RS CVn-type Binary SZ Psc

Abstract. We present the results of time-resolved high-resolution spectroscopic observations of the very active RS CVn-type star SZ Psc, obtained during two consecutive observing nights in 2011 October. Chromospheric activity indicators (including the Hα, Na i D1, D2, He i D3, and Hβ lines) formed at different atmospheric heights were analyzed using the spectral subtraction technique, which sho...

متن کامل

Segmentation of cDNA Microarray Images using Parallel Spectral Clustering

Microarray Image Microarray technology generates large amounts of expression level of genes to be analyzed simultaneously. This analysis implies microarray image segmentation to extract the quantitative information from spots. Spectral clustering is one of the most relevant unsupervised methods able to gather data without a priori information on shapes or locality. We propose and test on micro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008